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Abstract 

We provide a tight bound on the amount of experimentation under the 
optimal strategy in sequential decision problems. We show the applicability of 
the result by providing a bound on the cut-off in a one-arm bandit problem. 
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1 Introduction 

A basic issue faced by the statistician in sequential decision problems is the trade-off 
between the cost of pursuing the experimentation and the informational benefit from 
doing so. For instance, in bandit problems, the decision maker chooses whether to 
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pull an apparently optimal arm, or to pull some seemingly poorer one, in the hope of 
thereby getting valuable information. 

Such problems lead to unwieldy analytical problems, rarely amenable to closed- 
form solutions, which is arguably one reason why sequential methods are still seldom 
relied upon in practice (see Lai (2001), Armitage (1975)). For bandit problems, while 
the optimal strategy is well characterized and consists in pulling the arm with highest 
dynamic allocation index (Gittins and Jones (1974), Gittins (1979)), the explicit 
computation of these indices is rarely feasible, except for very specific cases where 
the risky arm yields a Bernoulli payoff (see for instance Bradt, Johnson and Karlin 
(1956), Feldman (1962), Woodroofe (1979), Berry and Fristedt (1985)). 

Over the years, a number of approaches have been pursued: (i) computing ap- 
proximate solutions of the corresponding dynamic programming equation, as in Berry 
(1972) or Fabius and van Zwet (1970); (ii) relying on close-by problems for which ex- 
plicit solutions are known, as in Lai (1987); (hi) using extensively numerical computa- 
tions, as in Lai (1988, for sequential testing of composite hypotheses); (iv) designing 
ad hoc policies, sometimes investigating their performance numerically, as in Corn- 
field, Halperin and Greenhouse (1969), Berry and Sobel (1973), Berry (1978) and, 
more recently, (v) finding explicit a priori bounds, as in Brezzi and Lai (2000). 

This note contributes to the last category. Motivated by economic applications, 
(see, e.g. Dixit and Pindyck (1994), Bolton and Harris (1999), Bergemann and 
Valimaki (2000), Keller, Rady and Cripps (2005), Rosenberg, Solan and Vieille 
(2007)), we consider general Bayesian, discounted sequential problems. The param- 
eter 9 has an initial distribution P (the belief of the economic agent). The agent 
repeatedly receives some information, chooses an action from a set A, and get a 
possibly unobserved instantaneous reward u(9,a). Future gains are discounted by 
a discount factor 5 G (0, 1). Given a decision rule a, and a stage n, we define the 
amount of experimentation in stage n to be the difference A n between the currently 
highest reward, and the current reward obtained when using a. 



We show that, for every optimal decision rule, the expected value of > A n does 




n=l 
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not exceed C5/(l — 5), where C is a bound_| on the reward function u. The bound 
is valid irrespective of the prior belief P, and no matter how information flows in to 
the decision maker. This result was used in Rosenberg, Solan and Vieille (2009) to 
show that the limit payoff of neighbors in connected social networks coincides, and 
to provide conditions that ensure concensus. 

We next show, by means of an example, that this bound is tight. We also illustrate 
how to use this bound in practice to derive a priori estimates for specific sequential 
problems. For simplicity, we focus on an instance of a one-arm bandit problem, for 
which no explicit solution is available, and give an estimate of the optimal boundary 
in the associated optimal stopping problem. In contrast to Brezzi and Lai (2000), 
who provide a bound on the Gittins' index in bandit problems, our bound is on the 
cut-off of the optimal strategy. 



2 Setup and Results 

The parameter set§| is a measurable space (Q,A), endowed with a prior distribution 
P. At each stage n > 1, a decision maker first gets an observation drawn from 
a (measurable) set S, then chooses an action a out of a (compact metric) set A, 
and gets a reward u(9,a). The decision maker discounts future rewards at the rate 
5 G [0, 1). The reward function u : Q x A — > R is (jointly) measurable, and continuous 
w.r.t. a. In addition, we assume that the highest reward u: 9 i— > max ag A u(9, a) and 
the lowest reward u: 9 i— > mm ae Au(9,a) have finite expectation. 

We stress that we place no restriction whatsoever on the nature of observations!^ 
e.g., they may depend, possibly in a random way, on the parameter 9, and on past 
observations and actions; they may or may not reveal past rewards; and they may be 
independent or not. 

■"■In particular, A„ < oo a.s., hence any optimal decision rule eventually stops to experiment. 
2 In spite of the qualifier "parameter", our decision problems are non-parametric, since the space 
is fully general. 

3 Beyond the minimal, technical assumption that the observation in stage n is drawn according 
to a transition probability from 0x(Sx to S. 
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Note that we assume that the current reward is a deterministic function u(u,a) 
of the parameter uo and of the action a. This assumption is made without loss of gen- 
erality. Statistical models such as multi-armed bandit problems, where the decision 
maker observes her current reward that randomly depends on 9 (and on a), can be 
cast into the above framework. Indeed, it suffices to re-label such a random reward 
as the "observation" , and to define the reward to be the expectation of the "observa- 
tion". Such a change does not affect the optimal decision rules, nor the optimal value 
of the problem. 

For a decision rule a, P a is the joint distribution of 9 and of the infinite sequence 
of observations and decisions. Expectation w.r.t. P CT is denoted by E CT . 

We focus on the amount of experimentation that optimal decisions entail. To be 
specific, let a decision rule a be given. Given a stage n, we denote by TC n the informa- 
tion available at stage n, that is, the cx-field induced by past observations and actions. 
When using the decision rule a prior to stage n, the expectation E a [u(9, a)\TC n ] is the 
expected reward when choosing a in stage n, given all available information, and 
u n := max agj 4 Ei a [u(9, a)\TC n ] is the myopically optimal reward. Thus, letting a n de- 
note the action of the decision maker in stage n, u n = E a [u(9, a n )\7i n ] is the actual 
reward that the decision maker expects to get in stage n, when following a. The 
difference A n := u n — u n provides a measure of the degree of experimentation per- 
formed in stage n. The infinite sum A n therefore measures the overall amount of 

n>l 

experiment at ion . 

Theorem 2.1 For any optimal decision rule a, one has 



5> 



n>l 



< (E [u] - E [u]) x 



(1-5) 



Beyond quantitative implications, this bound also yields qualitative implications. 
Consider for instance a multi-arm bandit problem. For simplicity, assume that the 
types of the various arms are first drawn, and that each arm then yields a sequence 



4 That is, a sequence (a n ) of measurable functions, where a n : (S x A) n 1 x S — > A is the decision 
in stage n. 
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of rewards, which is conditionally i.i.d. given its type. For concreteness, assume that 
with probability 1 over the types, the expected outputs of the arms are all distinct. 

Observe that, whenever the decision maker pulls a specific arm infinitely often, 
she eventually learns the type of this arm. Therefore, whenever the decision maker 
pulls two specific arms infinitely often, she eventually learns both types. Since one 
of these two arms is "better" than the other, this implies that the sequence (A n ) n >i 
then does not converge to zero. By Footnote 1, this event must have probability 0, 
for every optimal decision rule. In other words: any optimal allocation rule samples 
finitely often all arms but one. This provides an alternative proof of Theorem 2 in 
Brezzi and Lai (2000)1 

We next show that the bound in Theorem 12.11 is tight. 

Proposition 2.2 For every e and for every discount factor S, there is a decision 
problem with an optimal decision rule a such that E cr [^ n>1 A n ] > (E[tt] — E [u\) x 



The decision problem in Proposition 12.21 depends both on e and on 5. The next 
proposition improves in this respect, at a slight cost in the speed of convergence. In 
this statement, and given e > 0, we denote by N(e) the (random) number of stages 
in which A n is at least e: N(e) := \{n > 1 : A n > e}|. Plainly, ^A n > eN(e) for 

n>l 

every e > 0. 

Proposition 2.3 There is a decision problem such that for every 5 > 2/3 there is a 
unique optimal decision rule a that satisfies 

lim^Eo- [iV(e)l = +oo, for every a < 1. 

That is, as e decreases, the expected number E CT [iV(e)] of experimentation stages 
increases faster than l/e a , for every a < 1. 

5 Brezzi and Lai (2000) assumes that the states of the different arms are independent. Our 
argument dispenses with this assumption. 
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3 Proofs 



3.1 Proof of Theorem [2ZD 

+00 

Consider an optimal decision rule a. Set Y n := (1 — 5) 5 k ~ n E a [uk | TC n ]' Y n can be 

k=n 

interpreted as the continuation reward under the optimal decision rule (discounted 
back to stage n). Since -u fc < E ct [m | TCk) for all k > n, one has E CT [y^] < E[n]. 

Since one option available to the decision maker, from stage n on, is to ignore all 
future observations, and to keep choosing the action that was myopically optimal in 
stage n, we have 

Y n > U n . (1) 

Now, rewrite Y n as 

Y n = (l-5)u n + 5E a [Y n+1 I H n ] 

= (l-5)(u n -A n ) + 5E a [Y n+1 \H n }. (2) 

From ([1]) and ([2]) we obtain: 

U n < (1 - 5) (U n - A n ) + 5E a [Y n+1 I Hn], 

so that after cancelling u n from both sides and dividing by S, 

A n (l-5) 



U n < E^ \Yn+l I 

Substituting ([3]) into ([2]), we obtain 

Yn < {l-5)[E (T [Yn +1 \n n ]-An 



(3) 



1-5 



1 + <5E ff [Y n+l I H n ] 



< E,j [F n+ i I 7Y„] - — A n . 



Taking expectations, summing over n = 1, . . . , k, using E [u] < E a [Y n ] < E [u], and 
taking the limit as k goes to infinity, we obtain 



n>l 



< (E[u] -E[w]) x 



as desired. 
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3.2 Proof of Proposition 12^21 

Fix S > 0. Note that if the statement holds for Eq, then it holds for every e > sq. We 
will prove that the statement holds for e = 1/m, for any natural number m > 1/5. 
Let = {#!, 6*2, . . . , 9 m } and A = {a , a 1; . . . , a m } contain m and m + 1 elements 
respectively. The prior belief on G is uniform, and the reward function is given by : 

u(9 k ,a k ) = 1, fc = l,...,m, (4) 
u(9 k ,ai) = 0, fc = 1, . . . ,m, I ^ k, (5) 
u(9 k ,a ) = 0, fc = l,...,m. (6) 

Thus, once the parameter is inferred with certainty, there is a unique optimal action, 
whereas ex ante, all myopically optimal, while ao is (1/m) -sub-optimal. 

Information is provided to the decision maker according to the following rules: if 
the decision maker has chosen ao in all previous stages, the true parameter is revealed 
with probability c := jk^zh < 1; if the decision maker did not choose ao in all previous 
stages, no information is revealed, that is, no observation is made. Suppose the 
decision maker chooses ao until the state of the world is revealed, and then switches 
to the optimal action. The expected reward A satisfies A = c5 + (1 — c)5A, so that 
A = x _q_ c \ S - Substituting c = we obtain that the expected reward is 1/m, so 

that this strategy is optimal. However, for e — 1/m one has: 



^ r »w m £ m — 1 5 
B a [eN{e)) = - 



_n>l 

Since u = 1 and u = we get the desired result. 



c m 1 — 5 



3.3 Proof of Proposition [2TB1 

We provide an example within the class of Gaussian models. Set = R, and let 
the action set A = R U {— oo,+oo} be the set of extended real numbers, endowed 
with the usual topology. The reward function u(9, a) is equal to one if a e R and 
\9 — a | < 1, and equal to zero otherwise. 
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Given a normal distribution p with precision p (that is, with variance 1/p), define 
u(p) to be the highest reward that the decision maker may achieve, when holding 
the belief p. Observe that u{p) does not depend on the mean of p. Plainly, the map 
p i — ► u(p) is continuous and increasing, with \im p _ u(p) = 0, and \im p _, +00 u(p) = 1. 

The signalling structure of the decision problem is designed in such a way that 
the decision maker's belief is always a normal distribution. In addition, she keeps 
receiving additional information about 9 as long as she follows a pre-specified sequence 
of suboptimal actions. 

To be specific, let (e„)„>i be a decreasing sequence of positive numbers that 
satisfies (i) £^°=i e n e (1/2, 1), (ii) e n -nP -> +oo, for every (3 > 1, ancfl (iii) ^ > §. 
The sequence (p n )n>i is defined recursively by the condition 

u(pi H h p„) = ei H h £„. 

Let the prior distribution P be a normal distribution with precision p 1; and let 
(Cn)n>2 be a sequence of independent normally distributed variables with precision 
p n , and independent from 9. 

Observe that, in the absence of any information about 9, the decision maker's 
myopically optimal reward is u(pi) = £\. We set a\ = +oo. On the other hand, if 
she receives the observations s& := 9 + k = 2, • • • , n (n > 2), her belief over 9 
is normally distributed, with precision pi + ■ ■ ■ + p n - Hence, her myopically optimal 
reward is u{p\ + ■ ■ • + p n ) — E\ + • - • + e n , and there is an action a n (which depends 
on S2-, ■ ■ ■ , s n ), which yields an expected reward equal to E\ + ■ ■ ■ , +£ n -i- 

We now define the information received by the decision maker: 

• Prior to stage 1, the decision maker receives no observation; 

• Prior to stage 2, she receives the observation S2 = 9 + £2 if she played a\ = +00 
at the first stage, and no observation otherwise; 

• Prior to stage n > 2, she receives the observation s n = 9 + £ n if she played 
ai, a 2 , . . . , a„_i at the previous stages. Otherwise, she receives no observation. 

6 For instance, choose e n — , , 1 for n sufficiently large. 
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Playing the sequence (a n ) of actions is the unique optimal decision rule. Indeed, 
if the decision maker first deviates from that sequence at stage k > 1, she receives no 
further information, hence her optimal reward in all later stages is S\ + ■ • • + if she 
sticks to the sequence (a„), her continuation reward (discounted back to stage k) is 

oo 
n=k 

By (iii), this reward is higher than e\ + • • • + 

Note that a n is (myopically) e n -optimal, for each n > 1. Since the sequence (e n ) 
is decreasing, there are exactly n rounds in which the decision maker does not play 
a myopically e„-optimal action, so that by (ii) (e n ) a N(e n ) = n(e n ) a converges to 
infinity for every a < 1. 

4 Application 

We here illustrate how Theorem 12.11 can be used to derive a priori bounds on the 
optimal decision rules in specific decision problems. Since our goal is here purely 
illustrative, we restrict ourselves to the analysis of a specific one-arm bandit prob- 
lem, where the risky arm has two possible types, a good type and a bad type, and 
observations are i.i.d. In such a problem, the optimal decision rule consists of pulling 
the risky arm as long as the posterior probability assigned to the good type exceeds 
a specific cut-off, and then in switching permanently to the safe arm. 

We set the problem so as to depart as little as possible from a Bernoulli problem, 
for which a closed form expression for the optimal cut-off is known. We also make no 
attempt at optimizing our final bound. 

The type 9 of the risky arm takes values in the two-point set {9 , 9i}. Both types 
are ex ante equally likely. The safe arm yields zero. Given 9 = 9^ the risky arm 
may yield three different rewards, a, b and c, with probabilities Pa,p\ and p l c . These 
probabilities are such that (i) the expected reward of the risky arm is 1 if 9 = 6 ly and 

— 1 if 9 = 9q; (ii) one has In = a, In -| = 2a, and In = —a, for some a > 0. 

Pa Pb Pc 
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Denote by ir n the posterior belief that 9 = 9%, based on all observations prior to 



stage n, and let Z n = In 



71",. 



1 — 7T« 



be the log-likelihood ratio. Conditional on 9 = 9q, 



the sequence (Z n ) follows a random walk, which moves up by a (with probability p®), 
by 2a, or moves down by a between any two stages. 

The optimal decision rule consists in pulling the risky arm until the first stage a* 
where Z n = —k*a, for some k* G N, and then in pulling repeatedly the safe arm. We 
will derive an upper bound on k* using Theorem 12. 11 

The amount of experimentation in stage n is A n = max{0, 1/2 — 7r n }. For k < k*, 
let N(k) be the number of passage of the sequence (Z n ) at the level — ka, and denote 

by e(k) = 1/2 — — — — r- the corresponding value of A n . Thus, 



1 + e 



n=l 



E 

k<k* 



e{k)N{k). 



(7) 



Observe now that whenever Z Tl = —ka, the expected number of visits (including 
stage n) to — ka before Z n moves below — ka is 1/(1 — On the other hand, it is 
then the case that the sequence (Z n ) moves down to — (k + l)a. Hence, the probability 
that {Z n ) will move back to — ka before hitting — k*a iJl) at least p° a . Therefore, 



E 9o [N(k)} > 



P°a 



P°a 



(8) 



By Theorem 12.11 one has t^-^o 
and © yield 

^11- 



5> n 



n=l 



-ka 



n=l 



< 



25 



k*-l 



^ 2 1 + e~ ka 

k=0 

By monotonicity, the left-hand side of 



P°a 



is at least equal to 



Therefore, ([7]) 



(9) 



xa 1 a(k* - 

tanh — ax = — In cosh 

2 a 2 



1 , e 2 
> - In 

~ a 2 



(k* - 1) 



In 2 

a 



7 This bound is admittedly very crude. 
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Thus, 



// 1 • 2— • >- ' P ° a 



p2(i-*). 
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